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Abstract —We propose a computational model for shape, illumination and albedo inference in a pulsed time-of-flight (TOF) camera. In 
contrast to TOF cameras based on phase modulation, our camera enables general exposure profiles. This results in added flexibility and 
requires novel computational approaches. To address this challenge we propose a generative probabilistic model that accurately relates 
latent imaging conditions to observed camera responses. While principled, realtime inference in the model turns out to be infeasible, 
and we propose to employ efficient non-parametric regression trees to approximate the model outputs. As a result we are able to 
provide, for each pixel, at video frame rate, estimates and uncertainty for depth, effective albedo, and ambient light intensity. These 
results we present are state-of-the-art in depth imaging. The flexibility of our approach allows us to easily enrich our generative model. 
We demonstrate that by extending the original single-path model to a two-path model, capable of describing some multipath effects. 
The new model is seamlessly integrated in the system at no additional computational cost. Our work also addresses the important 
question of optimal exposure design in pulsed TOF systems. Finally, for benchmark purposes and to obtain realistic empirical priors of 
multipath and insights into this phenomena, we propose a physically accurate simulation of multipath phenomena. 

Index Terms —Time-of-flight, Bayes, depth cameras, intrinsic images, multipath 
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1 Introduction 

The commercial success of depth cameras in recent years 
has enabled numerous computer vision applications. 
Notable applications are human pose estimation |l| 2|, 
dense online 3D reconstruction of an environment EL 
and other uses—an overview is available in a recent 
special issue 2) and in the review article f5l . 

Broadly speaking we may differentiate between depth 
cameras based on triangulation and cameras which es¬ 
timate depth based on time of flight (TOF) |6, 7j. Fur¬ 
thermore, while in the context of TOF the cameras often 
operate using modulated illumination and sensing, and 
the computational methods usually employ phase-space 
reasoning EL in this paper we take a different approach 
which we now describe. 

Figure [T| describes the inputs and outputs of our 
system. We start with n concurrently captured intensity 
images obtained under active illumination of the scene, 
using n different exposure profiles. Using these n obser¬ 
vations at every pixel, we infer the depth, reflectivity, 
and ambient lighting conditions. We achieve this by 
using a generative probabilistic model that relates the 
unknown imaging conditions— shape , illumination and 
albedo —to the per-pixel camera observations. To perform 
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depth illumination albedo 


Fig. 1: System overview: the inputs are n pulsed TOF response im¬ 
ages, obtained concurrently using different exposure profiles. In realtime 
(30fps) we separate depth, ambient illumination and effective albedo at 
every pixel. 

inference we use either Bayesian inference or maximum 
likelihood estimation. 

However, achieving realtime video rate by direct ap¬ 
plication of these inference methods is infeasible under 
practical constraints on computation. Therefore we use 
an approach inspired by model compression E) and 
approximate the accurate but slow inference methods 
using regression trees, a fast non-parametric regression 
method 0. 

The regression approach has two advantages; first. 
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it allows to approximate inference in principled prob¬ 
abilistic models under tight compute and memory con¬ 
straints; second, it decouples the model from the runtime 
implementation, allowing continuing improvements in 
the model without requiring changes to the test-time 
implementation. 

We demonstrate this important advantage in Section 
[5] where we extend our generative model to a richer 
model which considers multipath effects. Our decou¬ 
pling of model+inference from runtime regression al¬ 
lows us seamless switching between different generative 
models, at no additional computational cost. To the best 
of our knowledge no other depth cameras have used a 
statistical regression approach for online depth inference. 

No matter which model we use for inference, at times 
there will be pixels that the model fails to explain. 
Common reasons are mixed pixels (imaging a depth 
discontinuity), sensor saturation, complex multipath, in¬ 
terference from another active device, or extreme image 
noise. We propose a robust fit-to-model score that can be 
used to detect and invalidate affected pixels from further 
processing. 

Having described inference and pixel invalidation, we 
address an orthogonal but important question in pulsed 
TOF systems: exposure profile design. We are flexible 
to choose exposure profiles and we directly optimize 
the expected accuracy of inferred depth using Bayesian 
decision theory. This yields a challenging optimization 
problem and we propose an approximate solution. 

Finally, we introduce an accurate TOF simulation pro¬ 
cedure based on physically accurate light transport sim¬ 
ulation. We use this capability for both exposure design 
and for synthetic but physically accurate benchmarking. 

1.1 Related Work 

Most commercially available time-of-flight cameras (as 
of early 2015) work using modulated time of flight [10, Ull , 
also known as phase-based time-of-flight. They generate 
a sinusoidal illumination signal and measure correlation 
of the reflected signal with a sinusoidal-profiled gain 
function of the same frequency, delayed by a phase 
shift Q2) For a fixed frequency and phase shift a 
recorded frame does not contain sufficient information to 
reconstruct depth. Therefore, modern systems typically 
record a sequence of frames at multiple frequencies 
and multiple phase shifts and use the combined set of 
frames to unambiguously infer depth using so called 
phase unwrapping algorithms 1711121. 

In contrast, our camera uses pulsed TOF, also known as 
gated TOF. This technology has differentiators in terms of 
hardware-related aspects (size, power, resolution) which 
are not relevant here, but let us highlight an important 
computational aspect of this camera: in contrast with the 
sine-like gain functions used in modulated TOF, we are 
allowed to choose from a large space of possible gain 
functions. Hence more general inference methods are 
required. 


Our work on optimizing the gain profiles has not been 
addressed in pulsed TOF systems, but for modulated 
TOF prior work (13]| has attempted to optimize the 
illumination profile to improve depth accuracy. 

Shape, Illumination, and Reflectance. Recovering the 
imaging conditions leading to a specific image—the 
inverse problem of imaging, is a long standing goal 
of computer vision. A recent modern treatment of this 
problem has been given in [14, [I5l HE T/\, with a 
comprehensive historical review. Conceptually the ap¬ 
proach in these works is similar to ours: find the most 
likely shape, illumination and albedo to give rise to the 
observed image. In contrast with these works, we do 
the inference at the pixel level and not the image level, 
being able to do so due to the unique imaging process 
we employ. Additionally our regression approach allows 
this inference to be done in realtime. Moreover, our 
shape output actually gives the full posterior depth 
distribution. This allows direct usage of our depth in 
incremental estimators or integrators such as 0 that 
specifically take care to maintain the state distribution 

at all times mum 

In the context of illumination estimation, we remark 
that there have been specific works on shadow re¬ 
moval EH EEL which is a nice byproduct of our approach 
(see Figure [TJ. 

Multipath Interference. Multiple reflections (multi- 
path) commonly occur in real scenes (6j. There is now a 
solid body of work on handling multipath in modulated 
TOF systems, but to the best of our knowledge there is 
no published work on handling multipath in pulsed TOF 
cameras. 

We briefly discuss work that exists for modulated TOF 
and relate it to our proposed solution. The work of [22] 
[23] and Il24l model the light reflections in the scene glob¬ 
ally to improve depth inference. To do this, they assume 
planar Lambertian surfaces and iteratively minimize an 
energy function. The methods work in important settings 
but the expensive minimization procedure precludes a 
realtime implementation. The work [25 , 26 , 07 \ assumes 
two-path interference from close-to-specular surfaces. 
The resulting methods are practical and efficient and our 
approach in Section [5] makes similar model assumptions. 
However, we work with different signals (pulsed TOF) 
and also provide a probabilistic model with uncertainty 
estimates. The work l28l generalizes the above two- 
path models to signals which arise from either two-path 
specular or two-path Lambertian reflectors; these signals 
are "compressible" and can hence be described with few 
parameters; the resulting method can be implemented in 
real time. Another recent branch of literature originating 
from work on transient imaging l29l uses modulated 
TOF imaging with Fourier-based reconstruction of the 
time-dependent light density. The work of [30] [31] re¬ 
constructs the transient light density for each pixel from 
a large number of modulated TOF images, each with 
a different modulation frequency. While this line of 
work could inspire practical multipath techniques and is 
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computationally efficient, currently the large number of 
required frequencies (several dozen) and the large acqui¬ 
sition time precludes realtime applications in dynamic 
scenes. 

The robust invalidation of observations seems to have 
not been considered before with the exception of |28l 
who provide an adhoc method for invalidation. Because 
we use a sound probabilistic model we can leverage and 
adapt standard methods in Bayesian practice |32| for this 
purpose. 

1.2 Contributions 

To summarize and as an aide in following the paper, our 
novel contributions are: 

Principled Framework 

- A probabilistic generative model for pulsed TOF 
imaging; 

- Principled inference of all latent imaging conditions, 
given camera observations 

- Accurate depth uncertainty estimates; 

- Robust Bayesian per-pixel invalidation for outlier 
observations; 

Practicalities 

- A novel use of regression to enable realtime inference 
under tight compute and memory constraints; 

- Complete decoupling of runtime mechanism from 
model and inference 

Extensibility and Multipath 

- A probabilistic model for depth inference in the 
presence of simple multi-path; 

Results 

- Experimental results showing robust video-rate in¬ 
ference of shape, illumination and reflectance, both in¬ 
doors and outdoors at direct sunlight; 

Computational Photography and Tools 

- Design of exposure profiles to directly optimize 
depth accuracy under task-derived imaging conditions; 

- A novel physically-based renderer for TOF simula¬ 
tion 

- Use of the simulator for both exposure design and 
benchmarking 

2 Modeling the Imaging Process 

We start with our camera's principle of operation, then 
formulate a generative model relating the unknown 
imaging conditions to the observable camera outputs. 
Assume that a specific pixel images a point at a certain 
distance and denote by t the time it takes light to travel 
twice this distance. The reflected signal is integrated at 
the pixel using a gain determined by a shutter signal 
£(•). If P(-) is the emitted light pulse, the reflected pulse 
arriving after time t is P(u — t). The observed response 
due to the reflected light pulse is 

-^active / S{u) p P(u — t) d(t) du. (1) 


Here p denotes the effective reflectivit^of the imaged 
point, and d(t) = ^ denotes decay of the reflected pulse 
due to distance. Therefore, the reflected pulse is down- 
scaled by a factor of pd(t). The quantity pP{u — t)d(t) is 
integrated with an exposure-determined gain £(•). 

Let us now consider the effect of ambient illumination. 
We denote by A the ambient light level falling on the 
imaged point. Then the reflected light level is pA, and 
we assume that during the integration period, this level 
of ambient light is constant. Therefore, the observed 
response due to ambient light is iZambient = f S(u) pXdu. 
The actual observed response is the sum of the responses 
due to active illumination and due to ambient light, 

R = J S(u) (,p P{u - t) d(t) + p A) du. (2) 

Equation (J2J specifies the relationship between the 
unknown imaging conditions (£, p, A) (depth, albedo, 
and ambient light level), and the observation we ob¬ 
tain at the pixel, when using the exposure profile 
£(•). We concurrently use n different exposure profile^] 
Si(-), S^G)? • • • ^ S'nG), and obtain n observations as 


Ri ' 


f Si(u)P(u — t)d(t) du 


f Si(u) du 


= P 


+p\ 


Rn 


f S n (u)P(u — t)d(t) du 


_f S„(u) d u_ 


— pC(t ) + pXA. (3) 


In short, we have the observed response vector 

R = pC(t ) + pXA. (4) 

Here C(t) is the expected response from a point at 
distance equivalent to time t, assuming unit reflectivity 
and no ambient light. This response is scaled by the 
reflectivity p and shifted in the ambient light direction 
A, the magnitude of the shift being the product of 
albedo and ambient light level. Equation Q is the model 
describing our imaging process. 

We remark that C(-) and A are determined by the illu¬ 
mination and exposure signals and are estimated using 
a simple camera calibration process which is outside the 
scope of this paper. 

Figure [ 2 ] shows the curve C(-) as a function of depth 
t, for a typical exposure profile design. The four colored 
curves denote the specific response curves of four expo¬ 
sure profiles Si(-),..., S 4 G), namely 

Ci(t ) = J Si(u ) P(u — t ) d(t) du. (5) 

Looking at Figure [ 2 } consider the response vector R we 
may expect from depth t = 150cm. We see that the first 
(blue) coordinate should be high, the second and fourth 

1. We use both the terms albedo and reflectivity. The quantity p we 
use in the model actually contains the effect of foreshortening and 
therefore we refer to effective reflectivity/albedo. 

2. The hardware s yste m enabling concurrent different exposure pro¬ 
files is described in | 33) and 15411351 . 
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(a) Response curve C 


(b) C/d, without decay 


Fig. 2: A typical response curve, (a) 


The actual curve C(-). As dis¬ 


tance grows the response decays, as per equation (5^. [(b)] Decay 
compensated response where we plot C(t)/d(t) = v*C(t) for more 
details (from now on we use decay-compensated curves for visualiza¬ 
tion). 


coordinates should be approximately equal and the third 
coordinate (red) should be the lowest. In contrast, at 
depth t = 190cm the first and second entries of R should 
be approximately equal (blue and green). Thus we see 
that by suitable design of the curve (?(•), we may expect 
to be able to infer depth accurately using the responses 
we observe. 

Since the response we observe is scaled by the albedo 
p, it may be tempting to normalize the response vector. 
However, as we discuss in the next section, the noise 
does depend on the magnitude and therefore the un¬ 
normalized response contains relevant information for 
depth inference and for predicting depth uncertainty that 
would be lost upon normalization. 


3 A Probabilistic Model 

We now rephrase 0 as a probabilistic model, relating 
the imaging conditions (£,p, A) to a distribution over 
responses R. Specifically we model R given t, p, A as 

R~P(R\t,p,\), (6) 

where we assume that P(R \ t, p, A) is a multivariate 
Gaussian distribution with mean vector as in UK 

E [R | t, p, A] = /2 (t, p, X) = p C(t) + p XA, (7) 


and with a diagonal covariance matrix 


£(M) 


( otpi + K 


\ 

U/i n H~ K J 


( 8 ) 


Here K is related to read noise —noise that is part of 
the system even when no light is present. The affine 
relationship between the magnitude of the response and 
its variance is due to shot noise and is well known 
[36, 371. We validate this noise model experimentally in 
the supplementaries. 

The generative model 0 is the distribution of the 
observed R at a pixel given the imaging conditions. 
We would like to infer the imaging conditions depth t, 
reflectivity p and ambient light level A given the obser¬ 
vation R. There are three mainstream approaches for do¬ 
ing so, namely Bayesian posterior inference, maximum 
likelihood estimation (MLE), and maximum aposteriori 
(MAP) estimation. 


3.1 Bayesian Inference 

We assume certain priors on depth, reflectivity and 
ambient light level, denoted by p(t), p(p), and p( A). In 
addition we assume independence between these fac¬ 
tors. Let us focus on inferring depth t, the most relevant 
unknown for depth cameras. Bayes rule gives 

P(t | R) oc P(R | t) p(t) 

= pit) JJ P(R\t,p,\)p(p)p(\)dpdX. (9) 

Equation (9} gives the posterior distribution over the 
true unknown depth. We get the posterior density up 
to a normalization factor which may be extracted by 
integrating over every possible t. The posterior density 
is the ideal input to higher level applications which 
use probabilistic models IfTHl H9l . For other applications, 
it may be sufficient to summarize this posterior dis¬ 
tribution by a point estimate, for example the poste¬ 
rior mean t^ ayes (R) = E [t \ R] or the MAP depth 
tmap{R) = argmax t P(t\R), together with a measure of 
the dispersion such as the posterior variance. 

Computationally, we have to solve the integration 
problem 0 at every pixel. Doing this at frame rate under 
low compute resources is currently not feasible. 

A second issue with 0 is that it requires the specifi¬ 
cation of priors p(t), p(p), and p( A). While using uniform 
priors on depth and reflectivity is physically plausible, 
specifying the prior on ambient light level is harder. For 
example, operating the camera in a dark room versus a 
sunlit room, would require very different priors. If the 
used prior deviates too much from the actual situation 
our estimates of depth could be biased, that is, suffer 
from systematic errors. 

3.2 Maximum Likelihood Inference (MLE) 

Alternatively we use maximum likelihood estimation for 
the imaging conditions, 

(We, Pm\e, Vie) = argmax P(R I t, p, A). (10) 

t,p, A 

Instead of considering the depth that accumulates the 
most probability over all reflectivity and ambient light 
explanations, we consider the single combined imaging 
conditions (W e , p m i e , A m i e ) which have the highest prob¬ 
ability of producing the observed response R. 

3.3 Maximum Aposteriori Inference (MAP) 

This method is the most likely point estimate taking into 
account prior preferences. We obtain it similar to the 
MLE estimate as 

((map, Ana p, V ap ) = argmax p{t) p{p) pi A) P(R | t, p, A). 
t,p, A 

(ii) 

The optimization problems |lQ| ) and 0 are non¬ 
linear because /!(•) is non-linear and because our noise 
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model ([ 8 } has a signal-dependent variance. An iterative 
numerical optimization is required and a frame rate 
solution at every pixel is infeasible. We discuss further 
details of the inference procedures for MLE, MAP, and 
Bayesian inference in the supplementary materials. 


4 A Regression Tree Approach 

All inference methods we propose, MLE t m \ e , MAP £ map , 
and Bayesian inference £ Bayes produce reliable depth 
estimates. However the computation of these estimates 
is expensive and impractical for a realtime camera sys¬ 
tem. To perform realtime inference we use a regression 
approach to approximate the model as follows. 

1 ) Offline: Sample imaging conditions pi, A i) from 
the prior and responses Ri from the model { 6 }. 
Then use one of the slow inference methods to 
generate labeled training data (RiR(Ri)). 

2) Offline: Train a regression tree/forest using the 
training data set, to obtain a predictor £rf. 

3) Online: Given an observed response R predict the 
inferred depth f RF {R)- 

Why would this procedure be a good idea? 

- First, t m i e , £ map , and £ Bayes are smooth functions from 
the response space to depth and are simple to learn. 

- Second, the regression tree £rf has small performance 
requirements in terms of memory and computation. 

- Third, it decouples the runtime from future changes 
to the probabilistic model and inference procedures. 

In principle it would be desirable to train directly on a 
large and diverse corpus of ground truth data captured 
from the real world; however, capturing ground truth 
depth data is challenging (38), expensive, and ensuring 
the diversity in imaging conditions is difficult. Training 
on our forward model instead allows us to represent a 
wide variety of imaging conditions. Likewise, while we 
could train directly on samples (RiRi) from the model 
this would incur additional variance because the noise 
makes R stochastic even for a fixed depth value. By 
training on the estimator (R,ti) instead we effectively 
remove this variance from the regression task. 

For learning the regression tree we use the standard 
CART sum-of-variances criterion in a greedy depth first 
manner ID For the interior nodes of the trees we per¬ 
form binary comparisons on the individual responses, 
Ri < a. At each leaf node b we store a linear regression 
model, 

t b {R) = el ■ [l,R 1 ,...,R n ,R 2 ,R 1 R 2 ,...,R 2 n ] T , (12) 

where we use a quadratic expansion of the responses. We 
estimate the parameters 0^ of each leaf model using least 
squares on all training samples that reach this leaf If39l . 

We cannot over emphasize the practical importance of 
a flexible and decoupled-from-model regression scheme, 
in handling unexpected or new phenomena. An example 
is detailed in the supplementaries. 


Approximation Tradeoffs. Because of its non- 
parametric nature the regression tree or forest can ap¬ 
proach the quality of the exact inference output if given 
sufficient training data and expressive power. However, 
the key limiting factor in our actual implementation are 
specific constraints on available memory and compute. 
Basically the depth of the tree and the structure of the 
leaf predictor, determine the memory requirements. In 
Section [9] we example the accuracy vs memory tradeoffs 
experimentally. 


4.1 Additional Regression Outputs 


In addition to the estimated depth we output several 
other quantities per pixel. These outputs too are pro¬ 
duced using trained regression trees. Specifically, we 
produce the following additional outputs: 

- Reflectivity, p, using E[p|/£] or via (10>, (111. 


Ambient light level. A, using E[A|ii] or via (10 1 , (111. 


- Depth uncertainty, as described below. 

- Fit-to-model invalidation score 7 , for detection of 
irregular imaging conditions, described in Section [ 6 ] 


4.2 Computing Depth Uncertainty 


In many applications of depth cameras to computer 
vision problems the estimated depth is used as part 
of a larger system; in these applications it is useful 
to know the uncertainty of the depth estimate. One 
example would be surface reconstruction 0 , where 
uncertainty can be used to weight individual estimates 
and to average them over time. 

We use the variance of the depth, in the form of 
a standard deviation a(R), as a measure of uncertainty. 
Depending on whether we use £ Bayes or t m \ e , we compute 
the standard deviation as follows. 

For t Bayes we use the posterior distribution 0, and 
directly compute a Baye s{R) = Jv t ~p(t\ii) M- 


For t m \e in (10 1 , we employ the approach described 
in [40l. A first order Taylor expansion of the gradi¬ 
ent (wrt imaging conditions) of the likelihood function 
in (101 is used to relate a perturbation A R in the response 
to the resulting perturbation of the estimator t m \ e (R + 
A R). This analysis leads to the covariance matrix of the 
maximum likelihood estimator and an approximation to 

the standard deviation, a m \ e (R) = y V[£ m i e ]. 

In Section [9] and Figure 8 (b)| we demonstrate the ac¬ 
curacy of our uncertainty estimates by comparing them 
with the actual observed uncertainty. In the context 
of phase-based TOF, previous work [41] used random 
forests to output depth confidence scores for measured 
phase signals; their regressor was trained using laser 
scans. Here we instead obtain uncertainty directly from 
our probabilistic model. 
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Fig. 3: Two-path model priors for the additional latent variables t 2 and 
P 2 - Left: The prior for the second bounce depth t 2 is uniform over the 
shaded polygon. Right: the prior for p 2 is defined over [0; 2] in order to 
handle large diffuse reflectors. 


5 Two-Path Model for Simple Multipath 

The generative model Q we used so far assumed a 
single direct response from the point being imaged. In 
order to account for multipath, this model needs to be 
extended as to describe the additional multipath light 
being integrated at the pixel. We demonstrate a simple 
extension of the model and inference as follows. 

Consider a two-path model as proposed in (25J [26J 
281. In addition to the three unknowns £, p, and A, we 
now also assume a second contribution having travelled 
depth £ 2 > t, from a patch with reflectivity p 2 . We extend 
the generative model ^ to 

P2{t, p, A, £25 P2) = P (c(t) + + p2C(t2)^ , ( 13 ) 

where p scales both the direct and indirect response, 
and p2 scales only the indirect response. The model is 
exact for a second specular surface, but becomes an 
approximation in case the second surface is diffuse. 
For inference we extend the inference procedures to 
this model in a straightforward manner (details in the 
supplementaries). 

For the prior of £2 we select a uniform prior relative 
to £, such that £ 2 — £ is uniform between 0cm and A 
(typically A = 150cm), that is, 

p(t 2 \t) = U(t 2 ]t,t + A). (14) 

For the second reflectivity we allow p 2 > 1 to be able to 
handle diffuse surfaces, and after studies of simulation 
data select a Beta distribution on the interval [ 0 ; 2 ]. 

p(p 2 ) = B(p 2 / 2; a =» 1, /? = 5). (15) 

This prior specifies that values up to p 2 = 2 are possible, 
but that low values of p2 are more likely. Both priors are 
visualized in Figure [3] and we will evaluate this model 
on real and simulated data in the experiments section. 

It is important to emphasize that our regression- 
decoupled-from-model approach allows us to seamlessly 
use this extended model in the camera, just by plugging 
it in the offline step 1 of the procedure outlined at the 
beginning of Section [4] The runtime process and its 
computational cost do not change at all. 


6 Bayesian Model Invalidation Score 7 


Our imaging model is an idealization of the real world 
and in each frame a certain number of pixels will have 
measured responses R which do not conform to this 
model. The main reasons for this are systematic errors 
such as multipath | 22 , 28j, pixels of mixed depth, sensor 
saturation, as well as statistical extremes in imaging 
noise. In Section [5] we extended our model to explain 
some multipath effects, but even this extended model 
may fail to explain some of the responses. 

When our model fails to accurately explain the ob¬ 
served response vector R we would like to detect such 
a deviation from the model assumptions and exclude 
affected observations from further processing. The strict 
Bayesian paradigm cannot detect deviation from model 
assumptions because it only provides the calculus to 
go from assumptions and observations to conclusions 
and no mechanism to falsify the assumptions them¬ 
selves [42) . However, in Bayesian modelling practice l43l 
a common method to assess deviations from model 
assumptions is to perform so called posterior predictive 
checks. 

We use the posterior predictive p-value [32} |44> 451 
for our purposes. Intuitively our particular p-value will 
measure the total probability mass of all observations 
which have a smaller likelihood than the likelihood of 
the observed response. Therefore the score is always 
between zero and one and a value close to zero indi¬ 
cates that the observation is unlikely under the assumed 
model. This intuition is helpful but the controversy 
around p-values and model checking more generally is 
deep and we give a brief discussion in the supplemen¬ 
tary materials. 

To formalize this problem, let us first unify notation 
by writing 0 = (£,p, A) or 0 = (£, p, A, £ 2 , P 2 ), depending 
on whether we use the single path model (0 or the two 
path model ( pB| , so that 0 are all the unknown imaging 
conditions to be inferred. Given an observed response 
vector R and using the model P(R\0) and the prior 
P{6) we can use Bayesian inference to infer the posterior 
distribution P(6\R). Following the above intuition the 
invalidation score 7 is defined as 


7 (R) = E, 


Q~P(Q\R) 


E 


'R'~P(R'\6) 


{P(R'\6)<P(R\0)} 


(16) 


Here we used the notation 1 {predicate} which evaluates 
to one if the predicate is true and to zero otherwise. 
The above equation integrates all probability mass of less 
likely observations, weighted by the posterior P(6\R). If 
we have many repetitions of the experiment 6 ~ P{6) 
and R ~ P(R\6) the scores 7 (R) would be uniformly 
distributed. The computation of ( p~ 6 | is essentially free 
during our approximate Bayesian inference procedure. 

The value ^(R) can be used to reject the null hypothe¬ 
sis of the assumed model: if 'y(R) < r for some threshold 
r we reject the assumed model for this observation. The 
score ( p~ 6 | > is also applicable to MLE and MAP inference if 
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(b) 20,000 (c) 40,000 (d) 100,000 


Fig. 4: Exposure profile optimization. Top: Simulated annealing over 
100 k iterations, finding response curves to minimize (T9), the expected 
error (MSE) in depth estimation. Bottom: snapshotsof the response 
curves after 20k, 40k, and after all 100k iterations. The x-axis is depth 




100 200 300 400 100 200 300 400 

(a) (b) 

Fig. 6: Left, [(a)! ba sis functions {Qj/d} for a fixed delay S and va rying 
width w. RigrTq(b)| all basis functions {Qj/d} jeJ , defined by Eq. jl7) . 

we replace the outer expectation by a unit point mass at 
the inferred imaging conditions 0 m \ e or 9 map , respectively. 
We evaluate the invalidation score experimentally in 
Section 19.71 

7 Exposure Profiles Design 

So far we covered our imag¬ 
ing model, our realtime re¬ 
gression approach and how 
we can invalidate responses 
unlikely to have been gen¬ 
erated by our model. We 
now turn to an orthogo¬ 
nal question of designing a 
suitable response curve C for use in |I]) (A is closely 
related to C and are both derived from Z which will 
immediately be defined). Recall from that C is the 
integral of the illumination pulse P with the exposure 
profile S. In the camera, a laser diode and driver produce 
the illumination pulse P, and its design is fixed. The 
exposure gain S, however, has a flexible design space 
parameterized by linear basis functions. We would like 
to design response curves C that will produce obser¬ 
vations from which low-error estimates of the imaging 
conditions could be inferred. 

The camera is able to use basic exposure gain profiles 
in the form of a boxcar function, as shown in Fig. [5] 
Each basic gain profile has two parameters: a delay o, 
and a width w. Each possible pair j = ( S , w) specifies 
one possible gain profile B 3 from a fixed discrete set of 
choices J. Typically the set J contains several hundred 
possible combinations. With (|5| we now get the basis 


function Qj associated to P 7 as convolution with the 
pulse, 

Qj(t) — j Bj(u) P(u — t) d(t) d u. (17) 

Fig. [ 6 ] shows a set of basis functions for all possible 
j G J. vVe represent the basis response curves as vectors 
Qj G M T , for a time discretization with T values. By 
stacking the m = \J\ vectors horizontally we obtain 
a matrix Q G M Txm containing all possible basis re¬ 
sponse curves. The possible design space represents 
each curve as linear combination or basis curves, that is 
S(-) = ^2 ZjBj(’). With p| we then obtain the combined 
response curve Cf) = YfzjQj(-)- To design not just one 
but n response curves £*(•) for i = 1 ,..., n, we represent 
the design space using a matrix Z G M mxn as 

C = Q Z, (18) 

where in C G M Txn the k 'th column contains the 
response for the k 'th exposure sequence. 

For the design objective we utilize statistical decision 
theory (46l to select Z to optimize the expected quality of 
depth inference. There are two components to this idea: 
the quality measure, and the expectation. The quality of 
depth inference is measured by means of a loss function 
which compares an estimated depth i with a known 
ground truth depth t to yield a quality score £(t, t). One 
possible loss function which we use is the squared error, 
£(t,t) = (t — t) 2 , but we can also use other functions, 
for example £t(t,t) = £(t,t)/t. For the expectation, as for 
the Bayesian depth inference, we devise priors, typically 
uniform, p(t), p(p), and p( A) over the unknowns. Then 
the design problem is 


argmin 

z 

^t,p,X ^R~P(R\t,p,\,Z) [^(£(-R)>£)] 

(19) 

sb.t. 

m n 

^ ^ ^ ^ Zji ^ -^shutter? 

3=1 i =1 

(20) 


m 

^ ^ 1 -{Zji>0} ^ -^sparsity, Z = 1, . . , , 71, 

n — I 

(21) 


J - X 

Zji G N, j — 1, . . . , 777, i =*■ 1, . . . , 71, 



where the notation l{ pre d} evaluates to one if the pred¬ 
icate is true and to zero otherwise. 

The constraints ( [2Q| and (2l) deserve some comments. 
Each captured frame contains a fixed number iT s hutter of 
light pulses, each of which is associated with a basic 
exposure signal Bj. These are assigned in integer units. 
The total number of basis functions that can be used 
is constrained by iT sparsit y due to various shutter driver 
restrictions. Because in each pulse a single basis function 
is selected, this makes the effective response curve C a 
non-negative linear combination of the basis functions. 

Solving p| is a challenging combinatorial problem 
on three levels: first, computing t(R) is the inference 
problem, which has no closed form solution. Second, as 
a result, computing the expectations also has no closed 
form solution. Third, more than just merely evaluating 
it, we would like to optimize the objective function over 
Z. 

The approximate solution which we adopt is as fol¬ 
lows (more details in the supplementary materials). We 


width w 

delay 5 

Fig. 5: A basis element j = 
( 5 , w) defining Bj(-). 
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(a) Scene (b) Ground truth depth (c) Multipath ratio (d) Pixel A (e) Pixel B (f) Pixel C 

Fig. 7: Insights into multipath using physically accurate light transport simulation, [(a)] A scene create d in B lender, [(b)] Ground truth depth, [(c)] 
Normalized measure of multipath intensity compared to direct contributions, (please see the main text) . I(d)|(f71 Normalized light densities for three 
selected pixels; pixel A has no multipath component, pixel B has one multipath component from 30-50cm further away, and pixel C, where a 
specular component gives rise to a narrow multipath response. 


approximate the objective function by a Monte Carlo 
evaluation for both expectations (imaging conditions, 
and responses): for i ss 1 ,..., K we draw ti, pi, A i, then 
draw Ri, then perform inference to obtain U = t(Ri ) 
and evaluate £ t = l(ti,ti). Finally we approximate the 
objective (191 as empirical mean Ylf=i F° r K = 8192 
samples this computation takes around one second. For 
optimization of ( [T9) we use simulated annealing (47l 
on a custom-designed Markov chain which respects the 
structure induced by ( |20) and ( [21) . 

Figure [4] shows the progress of the optimization pro¬ 
cess. We start the optimization at a completely closed 
exposure profile with zero gain, that is Z 3i — 0 for all j, 


a pixel by means of approximating an integral over light 
pathes connecting light sources to surfaces to camera 
pixels 1511 . 

Our modification is to record for each pixel a weighted 
set of light path samples {(wi, Li , ti )} i= typically a 
few thousand, say N = 4096. For each light path we store 
the intensity weight > 0 , the number of straight path 
segments Li e N, and the total length of the path L. The 
segment count allows us to distinguish direct responses 
(Li = 2 , emitter-to-surface and surface-to-camera) from 
indirect responses (Li > 2, multipath). Together with a 
fixed ambient value r measuring light intensity without 
active illumination, for example from a regular rendering 
pass, the path lengths and weights now permit us to 
simulate a realistic mean response vector p as 


i. 

We remark that the optimization scheme just described 
outperforms all our previous attempts to manually de¬ 
sign the exposure profiles. 

8 Multipath Modeling and Design 

In this section we describe our method for simulating 
realistic multipath images together with ground truth. 
Having a realistic simulation enables several important 
goals: 

- exposure design for reduced multipath artifacts 

- learning/obtaining realistic priors for multipath ef¬ 
fects 

- benchmarking 

We show results for the first and last goal in section 

m 

8.1 Time of Flight Simulation 

In computer graphics physically-accurate Tenderers are 
mature technology that are readily available. We adapt 
the open source Mitsuba Tenderer 03 . Mitsuba supports, 
based on physical modeling of light scattering, light 
transport simulation, integrating paths of light at every 
pixel, thereby producing a highly realistic rendered im¬ 
age. We adapt the code so that we obtain the total light 
path length and the number of segments of the light 
trajectory. 

In more detail, we modify two rendering algorithms, 
the bidirectional path tracer algorithm l49l and the 
Metropolis light transport (MLT) fl50l algorithm; nor¬ 
mally both algorithms are used to render the intensity of 


^ = ( 22 ) 

The sum in the second term approximates the time- 
of-flight integral f R C(t)dv(t), where v is an intensity 
measure over time. The division by d(L) is due to both 
Wi and C containing the distance decay function d(t ); 
see (|5). Once we have p we can optionally simulate 
sensor noise as specified in ([§). We provide details in 
the supplementaries. 

We remark that additional relevant work on light 
transport is considered in [52], 53], published indepen¬ 
dently and concurrently with our work. 


8.2 Simulation Results 

In part (a) of Figure [7] we show a synthetic scene. Part 
(b) shows the ground truth depth map corresponding to 
the scene. We marked three points (A, B and C shown 
in part 7(c)) at which we have different amounts of 


multipath. In parts 7(d) 7(e) and |7(f)| we show the 


depth histograms we obtain from our modified Mit¬ 
suba Tenderer. For every point, the histogram shows 
the distribution of distances travelled by the photons 
integrated at this pixel. This distribution is properly 
weighted to account for both distances and reflectivity 
of materials along the pathes. Furthermore we show the 
distribution of distances travelled over a direct path in 
blue (this essentially corresponds to a delta function), 
and distances travelled over multiple pathes in red. We 
see that at point A (part [7(d)) there is no multipath, while 
at point B (part |7(e)) there is multipath due to the wall. 
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(a) R± response and 500 random 
pixels. 



o 1 ^‘ ; -' 

1 2 3 4 5 

Depth predicted stddev (cm) 

(b) Predicted vs actual depth 
uncertainty. 


We now want to design an exposure profile that will 
be more resistant to multipath. Therefore we should 
measure the loss over responses that also include multi- 
path. We use our realistic simulator for that as follows. 
Given one or multiple 3D scenes and their realistic 
light transport pathes, we sample responses from these 
scenes. Formally, the scenes and the simulator are a more 
complex generative model G. We denote the sampling 
from this complex model by ( R , t ) ^ G, but do keep in 
mind that the model G uses multiple reflectivity values 
and ambient lighting along the pathes to generate the 
response R. Both P and G depend on the design Z 


Fig. 8: Predicted uncertainty versus actual uncertainty; the model is well- 
calibrated in that it accurately predicts depth uncertainty. 


through 0 and d22}, respectively. 


We combine both generative models in a mixture: a 
fraction /? e [0; 1] are samples from our assumed model 
prior p and P, and a fraction 1 — /3 are samples from the 


v;=! 


intensity w/o decay 

==! 


1 


llll 

iff: 


i 

— :-<C ... 


1 



JE; 




Ipi(R)P)] + (1 - P) ® At) ~ G lPi(R)M (23) 



Fig. 9: Sample scenes. Top: exposure profile used. Middle: first re¬ 
sponse image R±. Bottom: inferred depth image using the SP-MLE 
model. The left and middle column are scenes with a far-range design, 
the right column is a scene with a near-range design. The designs were 
obtained using different priors p(t) in (l~9) . 

We may see from the histogram the dominant additional 
path lengths - 30 to 50 cm in this case. Finally, in part 
we show a normalized measure of the percentage 
of intensity integrated from multipath (as opposed to 
intensity integrated from a single direct light path), for 
every pixel in the image. We see that corners and just in 
front of the wall or other vertical surfaces actually return 
more multipath signals than direct path signals. 

8.3 Multipath-Robust Exposure Profile Design 

In the exposure profile design objective ( [19} we take 
two expectations: the first over prior imaging conditions 
(prior p) and the second over the assumed forward 
model (forward model P, equation 0). This indeed 
is the way to minimize the loss when responses come 
from our basic generative model, which does not include 
multipath. 


We see that the design objective ( [19} can, by a simple 
change as in ( [23} , accommodate richer priors over scenes 
and effects such as multipath. We demonstrate this in 
section 19.51 

9 Experimental Results 

We use a prototype camera as shown in Figure [l] In our 
experiments we avoid reference and comparison with 
other depth cameras in terms of noise characteristics and 
variance of depth estimates because the validity of such 
comparison is affected by hardware configurations such 
as power used, field of illumination, resolution, ther¬ 
mal design constraints, and sensor sensitivity. Instead 
we focus on demonstrating the validity of our model, 
inference procedures, and regression approximations. 

Throughout the experiments we will use the abbrevia¬ 
tions SP and TP to refer to the single-path model 0 and 
the two-path model ( [13} , respectively. Depending on the 
inference method we use MAP, MLE, and Bayes, so that 
TP-Bayes for example means the two-path model with 
full Bayesian inference. 

9.1 Sample scenes 

We start with a few sample scenes shown in Figure [9] 
We designed two exposure profiles using two different 
uniform priors on depth p{t) in ([19} . The first prior 
focused on larger depths while the second prior focused 
on smaller ranges. The two left columns show outdoor 
and indoor scenes using the far range exposure profile, 
and the right column shows a scene captured with the 
short range profile. The middle row shows the first 
response image, and the bottom row shows the inferred 
depth (obtained using the regression tree). 

9.2 Accurate Depth Uncertainty 

Next we show that by accurately modeling the noise 
present in the observed response our model is able to 
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(8) R i (b) <T m ie(^) (C) Pmle(R) (d) A mle (^) 


Fig. 10: Posterior inference results under different illuminations and 
albedos. [(i)]First response image, exhibiting varying ambient light levels 
and albecfos, including strong shadows. [(b)] Posterior depth uncertainty 
(cm), higher under either stronger ambient light or lower albedo, [(c)] 
Inferred albedo map, in [0; 1]. [(d)] Inferred ambient illumination map. 


assess its own uncertainty in the inferred depth. To 
demonstrate this we capture 200 frames of a static scene 
as shown in Figure |8(a) and sample 500 pixel locations 
in the shown box. 

Since the camera is static, we can obtain the empirical 
standard deviation of the depth estimators for each of 
the 500 points. We plot this empirical depth uncertainty, 
against the predicted uncertainty obtained in the first 
frame as described in Section |4.2| (the predicted uncer¬ 
tainty is nearly identical over all 200 frames). Figure [8(b)] 
shows the good agreement between the predicted uncer¬ 
tainty and the actual uncertainty. This provides empirical 
data about how well the model is calibrated [54], that 
is, how accurately it judges the uncertainty in its own 
predictions. 

To gain some insight on what determines depth stan¬ 
dard deviation, we turn to Figure [lOj showing a part 
of the scene shown in Figure [T| In Fig. 10 (a)} we see 
one of the input responses, showing the combined effect 
of different albedos and shadows. In Fig. 10(b)| we see 
how imaging conditions affect the variance of the depth 
estimates. In the shadowed regions the ratio between the 
active illumination and ambient light is higher, and this 
generally leads to a tighter posterior. On materials with 
higher albedo (the white page vs the dark circles) the 
amount of reflected light is higher and this also leads 
to smaller variances (as compared with the variances 
on dark circles which reflect less light). In addition, the 
depth itself affects the measure of uncertainty but this is 
not illustrated in this zoomed scene. 


9.3 Ambient and Effective Reflectivity 


In our model, the inferred albedo image is illumination- 
invariant and therefore does not contain shadows. There¬ 
fore we can perform realtime shadow removal [20, [21]/ 
providing illumination-invariant inputs to computer vi¬ 
sion algorithms. This is illustrated in Fig. |10(c)| In 


Fig. |10(d)| we show the estimated ambient light level at 
each pixel. 

For more results on realtime extraction of illumination, 
reflectivity and shape please view the enclosed video. 
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Fig. 11: Regression tree errors compared to full inference. Left: Cumu¬ 
lative error distribution over test set. Right: mean absolute error over 
prior albedo and ambient levels. 


9.4 Regression Tree Approximation Quality 

As discussed we use regression trees to regress depth, 
thus approximating full inference which is infeasible in 
realtime. An optimized implementation running on a 
Intel HD Graphics 4400 GVU, evaluates a regression tree 
of depth 12 with full quadratic polynomials on a 200-by- 
300 pixel frame in 2.5ms. This means we are able to run 
four trees (depth, illumination, albedo and depth std) 
effectively at « lOOfps. The enclosed video shows this 
implementation running (the std was computed but not 
shown in the output windows). 

We now quantify the additional errors incurred due 
to the use of regression trees instead of full inference. 
The added error depends on the tree structure, which 
determines required memory resources as described in 
section [4] We tested two types of trees at three depths, 
yielding six possible tree structures. The two types of 
trees used either a linear polynomial or a quadratic 
polynomial on the leafs. The depths we used were 8,12 
and 16 (full binary trees). 

After training the trees, we generate test data by 
sampling from the prior imaging conditions and our 
generative model Q. We compute the baseline error 
by running full inference (in this case MLE, but similar 
results hold for Bayes) on the test data, then run the 
various tree predictors. Fig. [lT] shows the results. On 
the left we plot the cumulative distribution of errors 
over the test set. On the right we partition the test set 
by depths, and show the mean absolute error for each 
depth, averaged over all albedos and illumination levels. 
We see that as the trees get deeper the quality approaches 
that of the full inference. At 16 levels they essentially 
match. 

9.5 Design for Multipath Robustness 

Now we demonstrate how we may use our simulator 
for obtaining exposure profiles designed to be robust to 
multipath. In section [83] we allowed for more complex 
generative models during exposure profile design, one 
that will generate responses that are "contaminated" 
by multipath. We took two simple scenes of an object 
in front of highly reflective wall (scene provided in 
supplementaries) and used them as the model Q as in 
section [83] For the mixture model we used (3 = 0.5. We 
then ran our design optimization scheme to obtain a new 
exposure profile. Let us call this exposure profile the MP- 
resistant design. We compare it with the standard design 
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Fig. 12: Multipath-robust exposure profile design. Left column: original 
exposure profile. Right column: multipath-optimized exposure pro¬ 
file. From top-to-bottom: Top ; regular exposure profile and robust- 
to-multipath profile. Middle ; bias magnitude (cm) of resulting depth 
inference images. Bottom, Simulated RMSE (cm) including noise. 


We then perform camera mapping using the known cam¬ 
era intrinsics and reconstruct a matching 3D scene for 
our simulator. The synthetic scene allows us to assess 
the agreement between qualitative effects in the real 
capture and in the synthetic image. In our comparison 
we mask the results to the area occupied by the box and 
reflector, and in addition mask the bottom third of the 
sensor array because this particular camera has no active 
illumination design in this region. 

The top row in Figure [13] shows the scene with only 
the box, the bottom row shows the multipath-corrupted 
scene with a large diffuse reflector added. 

The following important observations can be made: 1. 
Comparing each of the four pairs of real and synthetic 
results the qualitative and quantitative error agree be¬ 
tween the actual recording and the simulation; 2. The 
multipath corruption is clearly visible for the single¬ 
path (SP) model in Figure 13(k)| and |13(1)| and to a 


smaller extend in the two-path (TP) models. Figure 13(m) 
and |13(n)l 


Overall the simulation agrees very well with the real 
camera system. We remark that beyond this single ex¬ 
periment we describe here, the simulator is in daily use 
in our group and we have seen excellent agreement 
between simulated results and live tests over many 
months of using it. 


obtained using f3 = 1. The two designs are tested on 
the scene shown in Fig. [7] We emphasize that this test 
scene is different than the scene used in design. Fig. [12] 
compares the results we obtain. The top row shows the 
regular design on the left, and the MP-resistant design 
on the right (obtained using two different values of 
/3 in ([23}). The second row shows the magnitude of 
the resulting depth bias. On the left we see significant 
biases due to multipath (compare with the multipath 
map in Fig. |7(c)}. With the MP-resistant design we see a 
significant reduction in bias. The bottom row shows the 
expected RMSE under both designs. Please note that this 
RMSE contains the jitter in the estimates due to pixel 
shot noise. In temporal integration procedures like [[3) 
this error diminishes, and therefore it is more important 
to have the bias reduced than have this RMSE reduced. 


9.6 Experimental Verification of Simulation Accu¬ 
racy 

Our light transport simulation is based on an accurate 
physical model of light and the simulation results should 
agree with the real camera. However, a real time-of-flight 
camera is a complex system with many components and 
potentially unaccounted for interactions between them. 
In this section we verify that our simulation serves as a 
good proxy for the real system. 

To this end, we take images from two real scenes with 
a box and an optional reflector, shown in Figure 13(a)| 
and 13(h)} keeping the camera static between captures. 


9.7 Robust Model Invalidation 

To demonstrate the performance of the proposed inval¬ 
idation mechanism we use the same real experimental 
data as in the previous section. We compute the 7 score 
for both scenes and for the SP-Bayes and TP-Bayes 
models. 

The results are shown in Figure [14] and show that 
robust invalidation is possible for this scene. In the next 
section we also report 7 score images for simulation data. 


9.8 Benchmarking using Simulation Data 

In this final experiment we leverage the ability of our 
simulator to provide ground truth depth. This allows us 
to assess the depth inference performance quantitatively. 
We use five scenes adapted from blendswap.com for 
this purpose. The depth range in each of these scenes is 
within 50cm to 500cm and the scene surfaces represent 
a good variety in materials and convex and concave 
geometries. 

The results are visualized in Figures [l5] to 16 and 
we report quantitative results for depth reconstruc¬ 
tion in Table [T| Two additional visualizations are 
provided in the supplementary materials. As error 
metric we use the 25/50/75 quantiles of absolute 
depth errors because these approximately correspond 
to easy/medium/difficult surfaces. We mask pixels in 
white for which no direct single-path response is created 
during rendering. These pixels typically correspond to 
either infinite rays or to perfectly specular surfaces. In 
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IRO (measured) 


World scanline plot (cm) 



(a) IRO (measured) 

IRO (measured) 



Absolute error of SP/MAP (cm) 


Absolute error of SP/MAP (cm) 


Absolute error of TP/MAP (cm) 


Absolute error of TP/MAP (cm) 


(b) IRO (syn) (c) Scanline (real) 


World scanline plot (cm) 
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(h) IRO (measured) (i) IRO (syn) (j) Scanline (real) (k) SP err (real) (I) SP err (syn) (m) TP err (real) (n) TP err (syn) 

Fig. 13: Validation of the accuracy of light transport simulation. From the measured IRO frames we use camera mapping to approximately reconstruct 
the 3D scene geometry and surface properties manually. The top row shows the no-multipath setting and we mask the frame so that only the target 
object is shown. The bottom row shows the multipath setting with a large white reflector added to the scene (left side). In the error results, each 
column corresponds to either real errors from the measured IR frames or is entirely simulated. Qualitatively there is an excellent agreement between 
real measurements and synthetic simulations. Small quantitative differences remain, for example in the no-multipath setting with the two-path model. 
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(e) TP error (f) TP 7 (g) TP error (h) TP 7 

Fig. 14: Robust invalidation of measurements that violate our model 
assumptions, as explained in Section [6] In the top row we show the 
single path model SP-Bayes depth infer ence errors on real data as with 
the previous experiment in Section [9^6] whereas the bottom row shows 
the TP-Bayes model. The 7 invalidation score robustly highlights pixels 
affected by multipath in 1(d)] (values close to zero in dark blue) leading 
to large depth errors in](c)] By thresholding the 7 score at a suitably 
chosen threshold these measurements can be excluded from further 
processing. Note that the two-path (TP) model handles multipath and 
the 7 score in[(h)]does not invalidate affected pixels. 


both cases the time-of-flight operating principle does not 
apply. 

We again re-emphasize that the absolute magnitude 
of the errors is not material and incomparable to other 
cameras for two reasons: first we report raw-depth errors 
with absolutely no spatial or temporal filtering that is 
usually present. Second, the jitter error highly depends 
on camera power, sensor size and other hardware char¬ 
acteristics which are of no concern in this paper. 

From the results we make the following observations: 

1 ) Surfaces with low reflectivity have large depth 
errors but the model is aware of this through 
a large inferred a value. For example, the black 


2) 


floor in Figure |15(d) and 15(e)} or the cooking 
stove in Figure |16(e) Improving on these regions 
would require increasing the light output or sensor 
sensitivity. 

Areas affected by strong multipath also have 
large depth errors; for example the ceiling in Fig- 
15(d)| The SP model <j does not indicate a po- 


ure 


tential error, but the 7 score can invalidate these ob- 


3) 


servations, for example the ceiling in Figure 15(f) 


The two-path model (TP) improves depth accuracy 
in every scene but also reports increased model a 
compared to the single-path model. For example, 
compare the absolute errors between Figures |15(d) 


and |15(j)[ and the a maps in Figure [l5(e)| and |l5(k)| 
4) Less invalidation happens in the two-path model. 
In all scenes, the TP-Bayes 7 invalidates less pixels 
compared to the SP-Bayes 7 map, because as a 
model it is a better representation for the physical 
simulator. 


5) In Table [T| the errors are significantly reduced by 
the two-path model, typically by 40 percent. 

These results and insights agree with extensive live 
tests performed in the process of productizing our sys¬ 
tem. 


10 Conclusion 

Our presented approach is based on sound probabilistic 
modelling given our understanding of the physical re¬ 
ality. Bayesian inference naturally provides a powerful 
formal calculus to perform depth inference given our 
modelling assumptions. We have shown that even a sim¬ 
ple model of multipath enables significant reductions in 
the depth error. However, both parts of our approach— 
the prior and model —are general and open to future 
extensions. For the prior we plan to develop scene- and 
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(f) SP-Bayes 7 


. invalidation score TP-BAYES 



(g) iro 


(h) Multipath ratio 


(i) TP-Bayes t 


(j) TP-Bayes error 


(k) TP-Bayes a 


(I) TP-Bayes 7 


Fig. 15: Rendered simulation (scene adapted from “sitting room” by cenobi, licensed CC-BY from blendswap.com). High errors are present due to 
low reflectivity surfaces and multipath; the multipath errors are reduced by the two-path model (ceiling, wall, floor). The uncertainty estimate a is 
higher for the two-path model, reflecting the multipath awareness (compare the & values at the ceiling). 



(a) Scene (b) Ground truth depth (c) SP-Bayes t (d) SP-Bayes error (e) SP-Bayes <7 (f) SP-Bayes 7 



(g) IRO (h) Multipath ratio (i) TP-Bayes t (j) TP-Bayes error (k) TP-Bayes & (I) TP-Bayes 7 

Fig. 16: Rendered simulation (scene adapted from “Country-Kitchen Cycles” by Jay-Artist, licensed CC-BY from blendswap.com). Overall strong 
multipath error reduction across the scene but higher overall single-frame jitter due to the strong ambient lighting. 
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1.60 

3.40 

6.55 

Kitchen Nr 2 

SP-Bayes 

6.01 

10.29 

17.26 

(supp. mat.) 

TP-Bayes 

2.86 

6.04 

12.65 

Country Kitchen 

SP-Bayes 

4.69 

9.63 

17.20 

Fig. 16 

TP-Bayes 

2.75 

5.87 

11.78 

Wooden Staircase 

SP-Bayes 

4.08 

8.59 

14.64 

(supp. mat.) 

TP-Bayes 

2.19 

4.85 

9.51 


TABLE 1: Predictive performance of the Bayesian single-path (SP) and 
two-path (TP) models on realistic data obtained from physically-accurate 
light transport simulation. Across all scenes the 25/50/75 error quantiles 
are significantly reduced by the two-path model, (raw-depth errors - no 
spatial or temporal filtering whatsoever) 


task-specific priors to be able to improve performance in 
the presence of strong multipath and ambient light. We 
envision more refined models of multipath, for example 


by replacing the two-path pulse response by a more 
accurate analytic model of diffuse Lambertian multipath. 
This would require adding further latent variables re¬ 
lated to multipath responses and creating suitable priors 
for them; this may be challenging but our simulation 
framework will likely enable us to make progress in this 
direction in the future. Our statistical view on time-of- 
flight enables all these extensions within a principled 
framework. 
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Appendix A 

Video Demonstration 

We submit a short video showing the robustness of our ap¬ 
proach. In the video we show robust live (real-time) inference 
of depth, reflectance/albedo and illumination. We demonstrate 
the effective separation between illumination and reflectance 
by waving a powerful light projector and noting that the albedo 
and depth map stay invariant to the changing illumination 
conditions. 

Please note that the RGB stream was shot by the person 
holding the light projector - and is unrelated to the depth 
camera stream (we added it for general impression of the scene 
- it is slightly confusing). 

Appendix B 
Inference Details 

For this section we will use the compound parameter vector 
0 — [t,p, A] T , or 0 = [t, p, A, £ 2 , p 2 ] T for the single- and two- 
paths model. This unifies the notation for all unknown imaging 
conditions we would like to infer. 

The response curve function C(t) appearing in the expres¬ 
sion for the mean photon response ft (see equation 0 in 
the main paper), is obtained from calibrated measurements 
of the actual camera, and then approximated by Chebyshev 
polynomials of degree sixteen |55|. Because the curves are 
smooth the Chebyshev approximation is compact yet very 
accurate and evaluation of C(t) also provides the derivatives 
| \C(t) and ^G(t) for no additional computational cost. 


B.1 Maximum Likelihood Estimation (MLE) 

The standard maximum likelihood estimate are the imaging 
conditions t, p, A which maximize the likelihood or equiva¬ 
lently minimize the negative log-likelihood 


argmin - log P(R\0) 
6 


n 

= argmin 

0 i= 1 


(Ri - /^(fl )) 2 
2 (am(6) + K) 


+ 2 l°g(<*Mi(0) + K) 


(24) 


With this Chebyshev polynomial approximation we can 
also compute derivatives with respect to 0 of the log- 
likelihood function, and the entire log-likelihood function be¬ 
comes smooth and twice differentiable. 

Solving the three-dimensional minimization problem in 
Equation p4} with standard Quasi-Newton methods such as 
L-BFGS |[56] is possible but often yields unreasonable result 
if we do not constrain the parameters. For example, negative 
values of p might have the lowest function value but are 
physically impossible. Another issue is that the response curves 
C are measured only within a reasonable range. Outside of this 
range, the Chebyshev approximations have arbitrary behavior 
which leads to implausible solutions. 

We therefore constrain the range of parameters using log- 
barrier terms 


argmin 

e 


E 


2(ap>i(0) + K) 


\ log(am(6) + K) 


T ^ ^ b (log (0j 6j, min) T log($j ; max $ 7 '))- 

3 


(25) 

(26) 


The scalar b — 10 -2 is a barrier coefficient and Omin, Omax 
are the smallest and largest values of each parameter we 
want to consider. The problem remains twice differentiable 
and quasi-Newton methods can be applied for finding local 
minima reliably because any local optima has to occur within 


the relative interior of the rectangle described by 0^ m i n and 

Oj, max- 

To find the global optimum, we restart the quasi-Newton 
method ten times with initialization sampled uniformly in the 
parameter ranges. For producing labeled training data this is 
more than sufficient. Even during exposure profile optimiza¬ 
tion, experiments on good and mediocre shutter designs have 
shown that after 10 restarts in 97% of the cases the same global 
solution was found as with 100 restarts. 


B.2 Maximum A-Posteriori Estimation (MAP) 

The maximum a-posteriori (MAP) estimate is similar to the 
maximum likelihood estimate but also considers the prior 
instead of only the likelihood distribution. We determine the 
estimate by minimizing the negative log posterior 


argmin — log P(0\R) 
e 


(27) 


: argmin 

o i=l 


^ W)2 + l log (am(0) + K) 
2(am(6) + K) 2 


-logp(0). 


Due to the particular choices of twice differentiable prior 
distributions, we can solve this problem right away with quasi- 
Newton methods. The log-barrier terms used for the maximum 
likelihood estimate are now implicitly defined in the prior. 
In fact, the constrained maximum likelihood estimate in the 
previous subsection can be understood as MAP estimate with 
an approximately uniform prior on the ranges 0 m in to Omax- 
The advantage of the MAP estimate is that when prior 
knowledge exists - for example a strong belief on the ambient 
light intensity - then we may incorporate it. In contrast, the 
MLE does not encode any preference for certain parameter 
values. 


B.3 Bayesian Posterior Inference 

The Bayesian point estimate is motivated by statistical decision 
theory |46 1 1571 . The Bayes estimator yields the lowest expected 
error. Assuming the squared loss function, the estimator is 
characterized as 

0 Bayes(-R) := argmin [||0- 0 ||%], (28) 


where 0 are the true but uncertain parameters. 

This decision problem has a closed form solution: namely the 
mean parameters under the marginal posterior distributions. 
Because the squared loss decomposes over parameters, so does 
the decision problem. 

For example, the Bayes estimator £b ayes for depth is given by 

t B ayes(P) = = J tp(t\R)dt. (29) 

The marginal posterior distribution p(t\R) can be written in 
terms of the joint distribution as 

p(t\R) = J p(6\R)dpd\ 

= [ dpdA , 

J p(R ) 

The Bayes estimator tBayes is therefore equal to 


E[t\R] = PPtfrftfVf , 
fp(6)p(R\e) dd 


( 30 ) 
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Appendix C 

Test-time Regression Tree Inference De¬ 
tails 



300 350 400 

Depth (cm) 


Fig. 17: Slice of the posterior distribution of the single-path model for 
fixed values of reflectivity p. In this case the posterior distribution of the 
single-path model is dominated by a single Gaussian-like mode. 


One way of computing the Bayes estimator is solving the 
integrals in the numerator and denominator for all parameters 
that we are interested in. We use a state-of-the-art numerical 
quadrature method (58) for vector-valued integrals over rect¬ 
angular regions. 

However, the numerical quadrature approach is very slow 
and has numerical issues that yield sub-optimal solutions. We 
therefore consider an alternative way to compute the Bayes 
estimators: Monte Carlo using importance sampling [59]. 

We observed that the posterior distributions of the single¬ 
path model are mostly dominated by a few important modes 
that often have symmetric shape, see Figure |17| The posterior 
can therefore be approximated well by a mixture of Gaussians. 
Using importance sampling with a mixture of Gaussians pro¬ 
posal distribution should therefore yield fast convergence to 
the true Bayes estimator. 

The proposal distribution is a mixture of k Gaussians placed 
at the outputs of k local opt ima of the MAP problem obtained 
as described in Section ( |B.2| . The proposal distribution is 

k 

q(9) <x (31) 

i= 1 


where k is the number of mixture components used and 6^ 
are the locations of these mixtures. For the covariance matrices 
H ( fc ) we use the inverse Hessian of the negative log-posterior 
(as in a Laplace approximation). Due to the particular choice 
of twice differentiable priors, the Hessian of the log-posterior 
are always positive definite in local optima. 

We generate samples fji ... ffm from q and re-weight each 
sample by wi = to account for the errors in the approx¬ 
imation of the posterior by q. These samples are then used to 
obtain Monte-Carlo estimates of the integrals in equation < [30) . 

We determine the number of samples required to approxi¬ 
mate the integrals by the effective sample size (ESS) l60ll6ll . 


EM 


(32) 


We stop sampling as soon as the ESS exceeds a threshold 
(usually in the order of 50—200). In most cases this threshold is 
reached with a small number of actual samples. Empirically we 
observe the importance sampling approach to be much faster 
and robust in practice than the numerical quadrature approach. 


The regression trees approach we described in the main pa¬ 
per has an advantage in terms of flexibility. We used this 
flexibility to solve several issues that we encountered during 
development of the prototype camera. While a full description 
of the issues and their seamless solution within this framework 
is beyond the scope of this paper, we want to provide one 
important example. 

One notes that all inference is based on the response curve 
G(-) which characterizes the pixel's response to depth. In the 
physical camera, due to various optical and semiconductor 
effects, this response curve varies between sensor elements, 
and this variation is smooth with the position of the pixel on 
the image sensor. As a result, instead of having a single curve 
C(-) as we described so far, we actually have a set of response 
curves C x , y {f), one for each pixel in the image. Using the 
regression tree framework, we had a simple seamless solution 
for this issue as follows: 

• During training, instead of sampling responses from a 
single curve C(-), we sample responses from multiple 
response curves corresponding to different parts of the 
image. To obtain the label t(Ri), use slow inference with 
the actual (position dependent) curve from which Ri was 
sampled. 

• We augment the feature vector to include pixel position 
in addition to the response R. 

• We extend the leaf model and add linear terms in pixel 
coordinates x and y. 

• Train the regression tree as usual. 

• During runtime: just add pixel position to the feature 
vector used to traverse the tree 

This example serves to show the added benefit of a flexible 
regression mechanism in extending the model to solve new 
and unexpected problems. 


Appendix D 

Model Checking and P-Values 

This discussion provides additional background for our choice 
of the posterior predictive p-value used in the main paper. 
P-values are highly controversial in the field of statistics, in 
particular for formal hypothesis testing. For example, in (44l 
Xiaoli Meng writes about the p-value, 

"There is perhaps no single notion in statistics, other 
than the p-value, that has been so widely used and 
yet so seriously criticized for so long." 

Many works have discussed this controversy and Berger f62l 
provides a nice formal summary of the issues of disagreement. 

For models that are fully observed the choice of a test 
statistic unambiguously defines the p-value. However, our 
model involves unobserved quantities, the imaging conditions 
t, p, and A, as well as £2 and p 2 for the two-path model. 
In this case, there is no single p-value to be used and this 
case is known as "composite null hypothesis" or models with 
"nuisance parameters". Also, in this case the test statistic in 
the classical p-value generally does not depend on the nuisance 
parameter, which is a drawback as it limits the choice of useful 
test statistics. 

The posterior predictive p-value (32l l44l addresses both 
problems by integrating the test statistic over the Bayesian pos¬ 
terior of the unknown variables. As test statistic we choose the 
likelihood of the observation given the unknown parameters, 
yielding equation |l6) in the main paper. 
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It is easier to understand the usefulness of the likelihood as 


a test statistic on a model that is fully observed. We visualize 
such an example with a simple Gaussian mixture model in 
Figure [18] For the case with unobserved variables the situation 
is similar except that the distribution changes as a function of 
the observation R. 


The posterior predictive p- 
value has known drawbacks, 
analyzed in (32 , 451. In particu¬ 
lar it is known that it can be too 
conservative in rejecting the 
null hypothesis. In our appli¬ 
cation this implies that we may 
not detect all detectable devia¬ 
tion from the assumed model. 
Intuitively the reason for this 
lack of power is that the p- 
value is not Bayesian and ac¬ 
tually does use the observation 
twice, once to define the pos¬ 
terior, and once in the compu¬ 
tation that defines the p-value, 
leading to overly optimistic 
agreement with the model. The 
so called partial posterior pre¬ 
dictive p-value (32l does suc- 


GMM PDF and 7 level sets 



Fig. 18: Invalidation score ex¬ 
ample. For a Gaussian mixture 
model level sets at significance 
levels 7 g {0.01,0.05} sepa¬ 
rate the total probability mass 
such that outside the level set 
the integrated probability is 7 . 
Samples outside these sets are 
rejected. 


cessfully address this issue but it is much more difficult to 
compute; in fact, although desirable, we have not found a 
practical method to compute it in our application. 


Appendix E 

Exposure Profiles Design Details 

To optimize the design objective (19) in the main paper, we use 
a simulated annealing app roach as follows. Let us abbreviate 
the objective function | |19) as 

/(Z) = \t,p,x,z) • (33) 

We introduce an auxiliary Gibbs distribution, parametrized by 
a temperature T > 0, 

r(Z,T)(xexp(Ll/(Z)). (34) 

We use a sequence of temperature parameters that is slowly 
decreased for a finite number of steps, that is. To > Ti > 
• • • > Tk, starting from an initial temperature T 0 = T sta rt 
down to a final temperature Tk — Tf ina i. The smaller T gets, 
the more peaked the distribution r(*,T) becomes around the 
minimum of /. Given a Markov chain sampler on r, this 
approach converges to the global minimum of /. 

We first discuss the Markov chain that we use, then give 
details about the temperature schedule. 


E.1 Markov Chain 

To account for the sparsity constraints on Z, our Markov chain 
uses an augmented state space (63) to avoid measure-theoretic 
difficulties of asserting reversibility in the context of changing 
dimensionality (64l l. 

We decompose Z into a binary matrix B G {0, l} mXn and 
a value matrix V G R mXn with Zji — BjiVji. This allows us 
to easily set weights to zero by setting Bji = 0 and have the 
reversible proposal readily available by setting Bji = 1. Our 
MCMC sampler is a reversible Metropolis-Hastings sampler 
and consists of the following transition kernels (moves): 

1) Move mass: Choose two matrix entries Vji, Vki randomly 
(uniform) and move a uniformly sampled value from one 



Fig. 19: Simulated annealing schedule used during shutter profile design 
optimization, here with K = 20,000 iterations. 


entry to another such that their total value stays the same 
and both are still positive. This kernel is reversible with 
itself. 

2) Swap values: Choose two matrix entries Wji, Wji ran¬ 
domly (uniform) and swap their values V and binary 
indicator value B. This kernel is reversible with itself. 

3) Set a weight to zero: Choose a matrix entry with Bji — 
1 randomly (uniform) and set it to zero. This kernel is 
reversible with the following kernel. 

4) Set a weight to nonzero: Choose a matrix entry with Bji = 
0 randomly (uniform) and set its binary indicator value 
to one. This kernel is reversible with the previous set-to- 
zero kernel. 

5) Perturb weight value: Choose a matrix entry Vji = 0 
randomly (uniform) and rescale its value with a log¬ 
normal sampled factor. This kernel is reversible with 
itself. 

6) Scale all weight values: Rescale all values V with a log¬ 
normal sampled scalar. This kernel is reversible with 
itself. 

The above kernels are combined with the following proba¬ 
bilities: 20% for the move mass kernel; 20% for the swap values 
kernel; 10% for the set-to-zero and set-to-nonzero kernels, each; 
30% for the perturb weight kernel; 10% for the global scaling 
kernel. 

E.2 Temperature Schedule 

For simulated annealing we use a geometric temperature 
schedule m, with the temperature at iteration k being 

Tk — T s tart ft , 

where we use the initial temperature T sta rt = 20 and a target 
temperature of T^ai = 0.01, so that 

ft — exp ^ — [log Tfinai log Tfinai] ^ . 

This leads to the schedule as shown in Figure [19] We typically 
use a K — 20, 000 or K — 100, 000 iterations. 

Appendix F 

Time of Flight Simulation Details 

We now discuss details of the physically accurate light sim¬ 
ulation that we use to simulate multipath phenomena. First 
we recap the basis of both the bidirectional path tracer (BDPT) 
and the Metropolis light transport (MLT) algorithms t49l l50l 
and then provide information about the variance reduction 
techniques we use. 
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F.1 Light Transport Formulation 

Assuming a geometric light model where light travels in 
staight lines and only interacts with surfaces, the measured 
light intensity at a pixel in a static scene without active 
illumination can be formulated as a path integral. This integral 
accumulates the intensity from light paths xo,x± .. .Xk+i that 
start in a point xo on an emitting surface and end in a point 
Xk+i on the pixel's sensor surface. The intermediate nodes of 
this path xi.. .Xk are surfaces in the scene. The integral can 
be formulated (see |50| for details) as 


OO r. 


J\A k + 1 


L e (xo —y x\)G(Xq x\) 


k 

Q (f(xi-i Xi x i+ i)G(xi x i+ 1 )) 

i= 1 


L s (x k Xk+i) &A(xq) ... dA(x k +i). 


(35) 


In this equation, 

• M is the set of all surfaces in the scene including emitters 
and the camera sensor and A is the area measure on M; 

• L e (x o —>> xi) is a function representing emitters. It is 
proportional to the light that is emitted from point xo in 
the direction of x\. It takes only non-zero values if xo is 
on emitter surfaces; 

• L s (xk Xk+i) is the equivalent of L e for the sensor. L s 
specifies how sensitive the sensor is for photons arriving 
at Xk+i from the direction of Xk- 

• f(xi -1 —> Xi Xi+ 1 ) is the bidirectional scattering 

distribution function (BSDF) describing how much light 
is scattered at surface point xi in direction Xi+i of an 
incoming ray from the direction of Xi-i; 

• G(xi -H- Xi+ 1 ) = V(xi -w- X i+ i) 1 ^yt 11 is the 

throughput of a differential beam between d A(xi) and 
dA(xi+i). V(xi Xi+ 1 ) is an indicator function for mu¬ 

tual visibility of Xi and £i+i, which means V is zero if the 
direct path between the two inputs is blocked, otherwise 
1; The variables fa, &+1 denote the angle between the 
beam and the surface normals at Xi and Xi+±. 

The observed response in a specific pixel of our time-of- 
flight camera from the emitted light pulse can be modelled by 
extending the path integral formulation above to 

-^active — / ^ / P(ll)L e (xo Y X0 G(Xq •<->• Xi) 

J 

k 

PI 1 Xi -> x i+1 )G(xi ++ X i+1 )) 


L s (xk —y Xk+i)Sj(u + ti) 
dA(xo )... &A(xk+i)&u. 


(36) 


We additionally integrate over time u and include the intensity 
of the emitted pulse P(t) as well as the shutter function Sj (t + 
t{). The time delay ti = cl of emitted light arriving at the sensor 
is the total path length, 

1 = Eh- 1 -Xi ii’ 


times the speed of light c. All terms involving time can be 
group together into the expression 

j P(u)Sj{u + ti)Au= ^0^- 

that only depends on the time delay ti corresponding to total 
path length. It corresponds to the curve Cj without the decay 
of light d(ti) due to distance l (The decay of light is already 


accounted for in the G terms of the integral). The measured 
response is then 

inactive = E f ^ Xi)G(X 0 *l) 

£r' 0 JM k + 1 d W 

k 

Y\ 1 ^ x i X i+ i )G(Xi ++ X i+1 )) 

i =1 

Ls{Xk y Xk-\- 1) 

dA(xo )... dA(xk+i)- (37) 

This formulation is identical to the path integral Equation < [35} 
but with the additional Cj(ti)/d(ti) term. 

We modified the bidirectional path tracer (BDPT) algo¬ 
rithm [491 and the Metropolis light transport (MLT) algo¬ 
rithm f50l in the Mitsuba Tenderer |48| to produce a weighted 
set of samples {(wi, Li,ti)}i=i,...,N of the path integral in 
Equation ( [35} . The weight of the path sample is wi, Li is the 
number ofTages and L is the time corresponding to the total 
path distance. We can generate samples of inactive by 


V^ w i 

hi d 


Cj (ti 


Considering all shutters Ci, C 2 ,... and adding constant am¬ 
bient light r to account for i4mbient/ we may obtain realistic 
estimates of the mean response vector 

N 

(38) 


F.2 Variance Reduction 

The BDPT and MLT rendering techniques are Monte Carlo 
methods and therefore estimates obtained from them will have 
a Monte Carlo variance. This variance does not originate with 
the underlying mechanism that is being simulated, but is due 
to the finite number of samples that are used for estimation. 
Whereas normal light transport rendering in computer graph¬ 
ics applications is targeted at estimating mean intensities in 
three spectral bands (RGB), we are instead interested in the 
time-of-flight density. Because it is a function instead of a small 
number of values, it is more difficult to obtain reliable estimates 
of this function. 

To improve the accuracy of our estimate with the given 
time and memory constraints, we use two variance reduction 
techniques: stratification and priority sampling. 

The starting point for both methods is a stream of weighted 
samples (u>i,Li,ti) being generated for each pixel. 

F.2.1 Stratification 

Stratification is a classic variance reduction technique based on 
prior knowledge of subpopulations which have lower within- 
population variation. It works by breaking up the estimation 
problem into one estimation problem per subpopulation and 
combining the individual estimates into one joint estimate. This 
reduces the variance of the joint estimate compared to lumping 
all subpopulations together in only one population and sam¬ 
pling and estimating from only this one population l59l Section 
5.5]. 

We stratify the incoming stream of samples into two sets. 
The first stratum is the set of samples Li — 2 and the second 
stratum is the set of samples for which Li > 2. For both sets we 
keep an equal number of samples, typically a few thousand. 
The following priority sampling is then performed on each of 
these two sets separately. 
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Fig. 20: Validation of the noise model ( 3 ). The variance of the response 
is approximately a linear function of the mean response [37]. We show 
samples from four responses in different colors and superimpose the 
least squares fit. 


F.2.2 Priority Sampling 

The output of the simulation is a set of path samples (wi , Li, U) 
for each pixel. For large image sizes this can require tens of 
gigabytes of storage. In particular for the MLT sampler many 
of these samples contain partially redundant information due 
to correlated sampling via runnign a Markov chain, and storing 
all of them is wasteful in terms of storage. Because for MLT the 
samples are correlated in time, one really does need to generate 
a large number of samples to get good estimates; it is only the 
storage that is wasteful. For BDPT the samples are uncorrelated 
but we can still improve estimates by replacing low-weight 
samples with more important ones due to the inefficiencies of 
importance sampling. 

Typically, in general Markov chain Monte Carlo simulations 
the samples are unweighted and one can simply thin the 
samples by taking, for example, every 10th sample only, or 
by using reservoir sampling to keep a random subset. Here, 
however, the samples from both BDPT (importance sampling) 
and MLT (Markov chain simulations) are weighted, and this 
naive strategy—while still valid in terms of providing an 
unbiased estimate—yields a high variance estimate because it 
discards important samples with high weights. 

To obtain low-variance estimates from few samples we use 
priority sampling [65]|, a close to optimal method addressing 
the above subsampling problem. Intuitively, priority sampling 
generalizes reservoir sampling to the case of weighted samples. 
It processes the input sample stream one sample at a time and 
keeps a fixed number of samples with adjusted weights. The 
weights are adjusted such that the estimate of any subset sum 
is unbiased, and the variance of weight subset sums is almost 
optimal uniformly over all possible subsets. 

We use priority sampling to thin the two sample streams for 
each stratum and after rendering is finished we simply output 
the kept samples and adjusted weights. 

Overall we found that the bidirectional path tracer (BDPT) 
often produces better results with lower variance and all 
simulation results in the main paper are obtained by running 
BDPT with 8192 samples per pixel. 

Appendix G 

Noise Model Validation 

To verify the noise model we assumed in Section [ 3 ] (equation [31 
of the mai n pa per, we used the experimental setup as described 
in Section [972] of the main paper: we sample 500 random pixels 
and capture zOO frames from a static scene. We then measure 
the variance of each pixel's response, as well as estimate the 
mean response; this provides empirical data about the actual 
noise present in the input signal. 



Fig. 21: The scene used for designing a multipath-resistant exposure 
profile. 


Figure [20] shows the noise model results in the form of a 
scatter plot of the variance of responses versus their mean. 
The data clearly validates the assumed noise model (|8j from 
the main paper, and shows that the signal-dependent roisso- 
nian shot noise component dominates except for very small 
intensities. 


Appendix H 

The scene used as an extended genera¬ 
tive MODEL 


As described in Section 8.3 in the main paper, we may use 
a realistic light transport simulation as a more complex gen¬ 
erative model Q in order to generate responses containing 
also multipath components. We used this method to design 
a multipath-resistant exposure profile. The scene we used 
is depicted in Figure [2l| The camera is pointed towards a 
reflective wall, and the responses sampled from the cylinder 
and the floor contain both a direct component and multipath 
components. We used two copies of this scene at two different 
scales. The exposure profile designed using this scene was 


tested on a very different test scene as described in Section 9.5 


Appendix I 

Additional Rendered Results 


In Section |9.8| of the main paper we omitted three scenes 


for space reasons; the results are provided here in Figure 22 
Figure [23] and Figure |24]and qualitatively agree with the results 
shown m the main paper. 


1.1 Example Problem with 7 

We now give an example where we suffer high depth errors 
but have no operational method to recognize this: the estimated 
uncertainty <7 is not overly large, and the invalidation score 7 
does not indicate deviation from the model. 

The situation occurs in the scene shown in Figure [23j and we 
highlight the area in Figure [25] In essence the problem is due 
to complex multipath phenomena involving diff use mu ltipath 
and multiple bounces as can be seen from Figure |25(b)| which 
leads to an observed response that is well within a high- 
probability region of the assumed two-path model (no inval¬ 
idation) and has a strong direct response component (leading 
to low < 7 ). 
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(d) SP-Bayes error 


(e) SP-Bayes a 
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(f) SP-Bayes 7 


(g) IRO (h) Multipath ratio (i) TP-Bayes t (j) TP-Bayes error (k) TP-Bayes & (I) TP-Bayes 7 

Fig. 22: Rendered simulation (scene adapted from “The Breakfast Room” by Wig42, licensed CC-BY from blendswap.com). Significant multipath 
error reductions due to the two-path model are visible (wall, table, floor, chairs). Specular surfaces (lamp shade) remain problematic. 




(a) Scene 



(b) Ground truth depth (c) SP-Bayes t (d) SP-Bayes error (e) SP-Bayes a 



(g) IRO (h) Multipath ratio (i) TP-Bayes t (j) TP-Bayes error (k) TP-Bayes a 

Fig. 23: Rendered simulation (scene adapted from “Kitchen Nr 2” by oldtime, licensed CC-BY from blendswap.com). 



(I) TP-Bayes 7 




(f) SP-Bayes 7 



(a) Scene (b) Ground truth depth (c) SP-Bayes t (d) SP-Bayes error (e) SP-Bayes <7 (f) SP-Bayes 7 



(g) IRO (h) Multipath ratio (i) TP-Bayes t (j) TP-Bayes error (k) TP-Bayes & (I) TP-Bayes 7 

Fig. 24: Rendered simulation (scene adapted from “The Wooden Staircase” by Wig42, licensed CC-BY from blendswap.com). 


In order to improve depth accuracy in regions such as 
the highlighted one, several approaches are relevant. We are 
currently considering temporal integration of the observed 
response and an imaging model that does not assume con¬ 
ditional independence among different sensor elements. For 


example, many surfaces are planar and recognizing deviations 
from planarity over multiple pixels could potentially provide a 
strong cue to recognize and correct for multipath interference. 
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Multi-Bounce Intensity 



100 200 300 400 



(a) Problematic area wit h high 
depth error (see Fig. |23(j)| 
that is not rec ognized by £ 
(se e Fig. |23(k)[ , nor by 7 (see 


Fig-|23(T) 


(b) Intensity measure at the high¬ 
lighted pixel showing complex multi- 
path phenomena with more than two- 
path responses and large diffuse mul¬ 
tipath. 


Fig. 25: Explaining a failure of a and 7 to recognize areas of large depth 
errors. 
















